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Abstract 

Approximate Bayesian computation (ABC) has become an essential tool for the anal- 
ysis of complex stochastic models when the likelihood function is numerically unavailable. 
However, the well-established statistical method of empirical likelihood provides another 
route to such settings that bypasses simulations from the model and the choices of the 
ABC parameters (summary statistics, distance, tolerance), while being convergent in the 
number of observations. Furthermore, bypassing model simulations may lead to signif- 
icant time savings in complex models, for instance those found in population genetics. 
The BCci algorithm we develop in this paper also provides an evaluation of its own per- 
formance through an associated effective sample size. The method is illustrated using 
several examples, including estimation of standard distributions, time series, and popu- 
lation genetics models. 

Keywords: Bayesian statistics — likelihood-free methods — empirical likelihood — 
population genetics — robust statistics 

1 Introduction 

Bayesian statistical inference cannot easily operate when the likelihood function associated 
with the data is not entirely known, or cannot be computed in a manageable time, as is the 
case in most population genetic models (JU EJ |3j) . The fundamental reason for this difficulty 
with population genetics is that the statistical model associated with coalescent data needs 
to integrate over trees of high complexity. Similar computational issues with the likelihood 
function often occur in hidden Markov and other dynamic models (j3J). In those settings, 
traditional approximation tools based on stochastic simulation (JSJ) are unavailable or unreli- 
able. Indeed, the complexity of the latent structure defining the likelihood makes simulation 
of such structures too unstable to be trusted. Such settings call for alternative and often 
cruder approximations. The ABC methodology (HJ |6j) is a popular solution that bypasses the 
computation of the likelihood function (see © and ([8]) for surveys); (j9j) validate a conditional 
version of ABC that applies to hierarchical Bayes models in a wide generality. 

The fast and polytomous development of the ABC algorithm is indicated by the rising 
literature in the domain, at both the methodological and the application levels. For instance, 
a whole new area of population genetic modelling (flOl EJ) has been explored thanks to the 
availability of such methods. However, both practitioners and theoreticians show a reluctance 
in adopting ABC, as some doubt about the validaty of the method pT] [T2 l [T3|) . We propose 
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in this paper to supplement the ABC approach with a generic and convergent likelihood 
approximation called the empirical likelihood that validates the new Bayesian computational 
technique as a convergent inferential method when the number of observations grows to 
infinity. The empirical likelihood perspective, introduced by (HU), is a robust statistical 
approach that does not require the specification of the likelihood function. However, while 
it does not appear to have been used before in the settings that now rely on ABC, this 
data analysis method also is a broadly (albeit not universally) applicable and often fast 
approach which approximation differs from the one found in ABC algorithms, even though 
both are rooted in non-parametric statistics. Therefore, this methodology can be used both 
as a solution per se and as a benchmark against which to test the ABC output in many 
cases. This paper introduces the BC i algorithm and illustrates its performances on selected 
representative examples, comparing the outcome with the true posterior density whenever 
available, and with an ABC approximation (|15p otherwise. 

2 Statistical Methods 

2.1 The ABC algorithm 

The primary purpose of the ABC algorithm is to approximate simulation from the center- 
piece of Bayesian inference, the posterior distribution 7r(0|y) oc 7v(6)f(y\0), when it cannot 
be numerically computed but when the distributions corresponding to both the prior ir and 
the likelihood / can be simulated by manageable computer devices. The original (JB) ABC 
algorithm is as follows: given a sample y of observations from the sample space, a sample of 
parameters (Ox, . . . , 6 m) is produced by 

Algorithm 1: ABC sampler 

for i = 1 to M do 
repeat 

Generate 6' from the prior distribution n(-) 

Generate z from the likelihood f(-\& ) 
until jo{r?(z),r/(y)} < e 
set 6i = 0', 
end for 

The parameters of the ABC algorithm are the summary statistic r/, the distance p{-,-} 
and the tolerance level e > 0. The basic justification of the ABC approximation is that, 
when using a sufficient statistic rj, the distribution of the 6^s in the output of the algorithm 
converges to the genuine posterior distribution when e goes to zero f|16|) . 

In practice, however, the statistic r] is non-sufficient and at best the approximation then 
converges to the genuine posterior 7r(0|r/(y)) when e goes to zero. This loss of information 
seems to be a necessary price to pay for the access to computable quantities. Furthermore, as 
argued below, it can be evaluated against the empirical likelihood approximation when the 
latter is available. Indeed, this approach does not require an information reduction through 
the choice of a tolerance zone or of a non-sufficient summary statistic. 

2.2 Empirical likelihood 

Owen (|14p developed empirical likelihood techniques as a robust alternative to classical like- 
lihood approaches. He demonstrated that, for some categories of statistical models, this 
approach inherited the convergence properties of standard likelihood at a much lower cost in 
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assumptions about the model (as detailed in SI). While ABC algorithms do require a fully 
defined and often complex (hence debatable) statistical model, we argue that one should 
take advantage of the approximation device provided by the empirical likelihood to overcome 
most of the calibration difficulties encountered by ABC, at least as a convenient benchmark 
against which to test ABC solutions. 

Assume that the dataset y is composed of n independent replicates y = (yi, . . . ,y n ) of 
some random vector Y with density /. Rather than defining the likelihood from the density 
/ as usual, the empirical likelihood method starts by defining parameters of interest, 0, as 
functionals of /, for instance as moments of /, and it then profiles a non-parametric likelihood. 
More precisely, given a set of constraints of the form 

E F [h(Y,e)] = 0, (1) 

where the dimension of h sets the number of constraints unequivocally defining 6, the em- 
pirical likelihood is defined as 

n 

L e i(.9\y) = max T\pi (2) 
p f-h 

for p in the set {p 6 [0; l] n , Y^Pi = 1> ^2iPih{yi-,0) = 0}. For instance, in the one- 
dimensional case when 6 = E#[Y], the empirical likelihood in 8 is the maximum of the 
product pi ■ ■ ■ p n under the constraint piyi + . . . + PnVn = 0. (Solving pi) is done using the 
R package 'emplik' developed by (|17|) and based on the Newton-Lagrange algorithm.) When 
the data are not iid, an underlying iid structure may sometimes be exploited, as illustrated 
in the dynamic model section below. However, this is not always the case, meaning that the 
empirical likelihood method remains out of reach in some complex cases when ABC can still 
be implemented. Furthermore, as pointed out in the SI by a quote from Owen, the validation 
of the approach depends on a choice of the set of constraints that ensures convergence. 

While the convergence of the empirical likelihood is well-established (see SI and (|18|) ). 
the Bayesian use of empirical likelihoods has been little examined in the past, apart from a 
Monte Carlo study in (Tl9|) . and a probabilistic one in ([20]) . 

2.3 BCei 

The most natural use of the empirical likelihood approximation is to act as if this represen- 
tation was an exact likelihood, as in (|19p . Incorporating this perspective into a basic sampler 
leads to the following algorithm: 
Algorithm 2: Basic BC e i sampler 
for i = 1 to M do 

Generate 6i from the prior distribution n(-) 
Set the weight Ui = L e i(Bi\y) 
end for 

The output of BC e i is a sample of size M of parameters with associated weights, which 
operate as an importance sampling output (5). Thus, the performance of the algorithm can 
be evaluated through the effective sample size 




which approximates the size of an iid sample with the same variance as the original sample. 
As shown in (|2ip . this quantity is always between 1 (corresponding to a very poor outcome) 
and M (corresponding to an iid perfect outcome). 
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Any algorithm that samples from a posterior distribution (e.g., MCMC, Population Monte 
Carlo, SMC algorithms, see ([5])) may instead use the empirical likelihood as a proxy to the 
exact likelihood. For instance, to speed up the computation in the population genetics model 
introduced below, we resorted to the adaptive multiple importance sampling (AMIS, (f22|) ) 
which is easy to parallelize on a multi-core computer. While the original target distribution 
is ir(0)L(O\y) and the AMIS algorithm uses several (multivariate) Student's t distributions, 
denoted i3(-|m, X) (i.e., with three degrees of freedom, centered at mean m and with covari- 
ance matrix S), as an importance sampling distribution, the algorithm can be adapted to 
the empirical likelihood in a straightforward manner: 

Algorithm 3: BC e i-AMIS sampler 

for % = 1 to M do 

Generate On from the prior distribution qi(-) 

Set = L e i(0i t i\y) 
end for 

for t = 2 to T M do 

Compute weighted mean mj and weighted variance matrix of the s j (1 < s < t — 1, 
1 < i < M). 

Denote qt(-) the density of t3(-\m t , St), 
for i = 1 to M do 

Generate 0^ from q t (-) . 

Set u t> i = ir(O t ^L e i(O t ,i\y) / £«=i q s (Ot,i) 
end for 

for r = 1 to t — 1 do 
for % = 1 to M do 

Update the weight of Q r i as 
u) r ,i = K(O t ,i)L el (0 r! i\y) / ^1=] Qs{O r ,i) 
end for 
end for 
end for 

The output is thus a weighted sample 6 t: i of size MTu- 

In contrast with ABC, BC c i algorithms do not usually require simulations from the sam- 
pling model, given that ^ provides a converging and non-parametric approximation of the 
likelihood function. This feature thus induces very significative improvements in computing 
time when the production of pseudo-datasets is time consuming, since solving is usually 
immediate. This is for instance the case in population genetics and the last section of the SI 
provides an illustration of a huge improvement in comparison with ABC in two experiments 
described below. However, the improvement in speed may vanish in cases when producing an 
iid structure connected with the constraint ([!]) requires simulations from the sampling model, 
as illustrated by a counter-example for point processes in the SI, BC i and ABC then break- 
ing even in terms of computing time. Even though the computing time required by BC e i is 
customarily negligible when compared with ABC (or does not induce any extra time as in the 
point process counter-example), we further caution against opposing both approaches solely 
based on computing times, since they differ in the approximations they provide to a genuine 
Bayesian analysis and thus should be used in conjunction. 

Using empirical likelihoods means there is no calibration of the many tuning parameters of 
ABC algorithms; most significantly, the likelihood ratio acts as a natural distance and impor- 
tance weights produce an implicit and self-defined quantile on the original sample simulated 
from the prior. Notwithstanding these appealing qualities, BC e i still requires calibration, 
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in particular in the choices of the parameterization of the sampling distribution and of the 
corresponding constraints ([I]) defining the empirical likelihood. Some examples are discussed 
below. The BC c i-AMIS sampler also implies computing values of the prior density, up to a 
constant, which may be an hindrance in peculiar cases. 

2.4 Composite likelihood in population genetics 

ABC was first introduced by population geneticists (f2"lllU[[6"|) interested in statistical inference 
about the evolutionary history of species, as no likelihood-based approach existed apart from 
very rudimentary and hence unrealistic situations. This approach has been used in a number 
of biological studies ()23l 124} 1251) . most of them including model choice. It is therefore crucial 
to obtain insights into the validity of such studies, particularly when they have economic, 
biological or ecological consequences (see, e.g., (26)). This can be achieved in part by running 
a comparison using BC i. Furthermore, given the major gain in computing time, due to the 
absence of replications of the data, BC e i can be applied to more complex biological models. 

The main caveat when using the empirical likelihood in such settings is selecting a con- 
straint ([!]) on the parameter of interest: in phylogeography, parameters like divergence dates, 
effective population sizes, mutation rates, etc., cannot be expressed as moments of the sam- 
pling distribution at a given locus. In particular, the data are not iid. However, when 
considering microsatellite loci with the stepwise mutation model (|27p and evolutionary sce- 
narios composed of divergence, we can derive the pairwise composite scores whose zero is the 
pairwise maximum likelihood estimator. Composite likelihoods have been proved consistent 
for estimating recombination rates, introducing an approximation of the dependency struc- 
ture between nearby loci (|28l \29\ \30\ I3T1) . (See also (j32j) for composite likelihoods used in a 
likelihood- free setting.) 

More specifically, we are approximating the intra-locus likelihood by a product over all 
pairs of genes in the sample at a given locus. Assuming that yf denotes the allele of the 
i-th gene in the sample at the k-th locus, and that (ft is the vector of parameters, then the 
so-called pairwise likelihood of the data at the k-th locus, namely y k , is defined by 

My fe |0) = n^ fc ,^) 

i<j 

and the corresponding pairwise score function is log £2 (y k \ <ft) ■ Pairwise score equations 

E f \y 4t ]og£ 2 (y\<f>)] = o 

provide a constraint ([!]) in every way comparable to the score equations that give the max- 
imum likelihood estimate and which is quite powerful for empirical likelihood derivations 
( (|18p . pp. 48-50). Hence the empirical likelihood of the full dataset y = (y 1 , . . . ,y K ) given 
4> is computed with Q under the (multidimensional) constraint that 

K 

5>fcV log^ 2 (y fc |(/>) = O. 

k=l 

When the effective population size is identical over all populations of the demographic 
scenario, the time axis may be scaled so that coalescence of two genes in Kingman's genealogy 
occurs with rate k{k — l)/2 if there are k lineages. In this modified scale, mutations at a 
given locus arise with rate 6/2 along the gene genealogy. Our mutation model is the simple 
stepwise mutation model of (|27p . i.e. the number of repeats of the mutated gene increases or 
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decreases by one unit with equal probability. Given two microsatellite allelic states x\ and 
X2, their pairwise likelihood ^2(^1, X2\4>) depends only on the difference of the states x\ — x 2 . 
If both genes belong to individuals that lie in the same deme, then (see SI and (33)) 

hlxxMt) = p(0) lX2 ~ xll /Vi + 2e 

where p{9) = 9 / (l + 9 + \f\ + 2$) . If the two genes belong to individuals from demes having 
diverged at time r, then (j33|) 

T +00 

£ 2 (x 1 ,x 2 \cf>) = : -=== Y, PV) W I\xi-x>\-k(r9) 

k=— 00 

where Is(z) denotes the <5th-order modified Bessel function of the first kind evaluated at z. 
Computing the pairwise scores, i.e. partial derivatives of log li{x\ % X2\4>) from those equations, 
is straightforward, by virtue of recurrence properties of the Bessel functions. Algorithm 
BC e i is therefore directly available in this setting, and furthermore at a cost much lower than 
the one associated with ABC algorithms (Table SI). 



3 Results 

3.1 Normal distribution 

Starting with the benchmark of a normal distribution with known variance (equal to one), 
we can check that the empirical likelihood allows for a proper recovery of the true posterior 
distribution on the mean. Fig. SI shows that a constraint ([!]) based on the mean works well, as 
do the two constraints on mean and second central moment, E[(X — 9) 2 ] = (Figure S2). On 
the other hand, using the three first central moments in the empirical likelihood may degrade 
the fit (three cases in Fig. S3). While this poor fit is not signaled by the ESS (which is often 
larger than in Fig. S1-S2, because of the growing disconnection between the approximation 
and the true likelihood and hence a more uniform range of the weights), a parallel run of the 
method with different collections of constraints does detect the discrepancy. This illustrates 
the variability of the empirical likelihood approximation, as well as its sensitivity to the 
choice of defining constraints. While a drawback of the method, this variability can be 
tested and evaluated by comparing outcomes, due to often limited computing costs. This toy 
experiment also supports the generic recommendation (18) to keep the number of constraints 
and parameters equal. 



3.2 Quantile distributions 

Quantile distributions are defined by a closed-form quantile function F~ 1 (p; 9), and generally 
have no closed form for the density function. They are of great interest because of their 
flexibility and the ease with which they can be simulated by a simple inversion of the uniform 
distribution. A range of methods, including ABC approaches (|10p . have been proposed for 
estimation (see SI). We focus here on the four-parameter g-and-k distribution, defined by its 
quantile function, denoted Q{r; A, B, g, k) and equal to 

A I B (l I c 1 ~ e M-9*(r)) 
V l + exp(-gz(r)) 

where z(r) is the rth standard normal quantile; the parameters A,B,g and k represent 
location, scale, skewness and kurtosis, respectively and c measures the overall asymmetry 
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(134} I35p . We evaluated the BC e i algorithm for estimating this distribution using two values 
of 9 = (A, B, g, k), two sets of priors and various combinations of n, M and p, where p is the 
number of percentiles used as constraints (see details in SI). 

Figure [T] illustrates the true and fitted curves and a 95% credible region for the case with 
n = 100, M = 5000 and p = 3. The corresponding posterior means (standard deviations) for 
the parameters A,B,g, k were 3.08(0.14), 1.12(0.23), 1.79(0.25), 0.41(0.12), respectively. The 
choice of sample size and number of constraints did not substantively affect the accuracy of 
parameter estimates, but the precision was noticeably improved for the larger sample size; 
see Figures S4, S5, and S6. 



Figure 1: True (black) and fitted (brown) cdf functions with a pointwise 95% credible 
(shaded grey) region centered on the fitted cdf for a dataset of n = 100 observations from 
the g-and-k distribution, based on M = 10 3 simulations of BC e i. 

The accuracy and precision of the estimates were broadly comparable with the results 
obtained by (|36p for the same distribution. Based on the whole experiment, the parameters 
A and B were well estimated in all cases, while the estimates of g and k were poorer for smaller 
values of n and M. For small n the estimates were more subject to the vagaries of sampling 
variation, whereas for small M they were subject to the influence of a smaller number of very 
large importance weights. However, given the speed of BC c i compared with competing ABC 
algorithms, it is feasible to use even larger values of M than considered in this experiment, 
since there is no requirement to simulate new datasets at each iteration. Moreover, this 
experiment is based on the very basic case of sampling from the prior; the results would be 
further improved by using an analogue of BC e i-AMIS or alternative approaches similar to 
those proposed by (f37|) for ABC. 

3.3 Dynamic models 

In dynamic models, the difficulty with empirical likelihood stems from the dependence in the 
data (yt)i<t<T- However, these models can be represented as transforms of unobserved iid 
sequences {tt)i<t<T- The recovery of a converging empirical likelihood representation thus 
requires the reconstitution of the ej's as transforms of the data y and of the parameter 9. 
Independence between the ej's is then at least as important as moment conditions. (This 
implies equivalent computing times for ABC and BC c i.) 

For instance, consider a simple dynamic model, namely the ARCH(l) model: 

y t = a t e t , e t ~ Af(0, 1) , of = a + aiy t 2 _i , 

with a uniform prior over the simplex, i.e., ao, a± > 0, ao + ai < 1. While this model can be 
handled by other means, since the likelihood function is available, we will compare here the 
behaviour of ABC and BC c i algorithms. 
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First, a natural empirical likelihood representation is based on the reconstituted et's, 
defined as yt/crt when the ot's are derived recursively. Figure [2] shows the result of estimating 
both parameters ao and ot\ when Algorithm ABC uses as summary statistics either the least 
square estimates of the parameters (derived from the series (yt)), which we label "optimal 
ABC" in connection with (|38p . or the mean of the series log(y?) supplemented by the two 
first autocorrelations of the series (yf ). The constraints in the empirical likelihood are either 
based on the three first moments of the reconstituted e^'s or on the variance of those et's 
complemented by both the correlations between the yt—is and the et's and between the 
£i_i's and the et's. As seen from this experiment, BC e i does as well as the optimal ABC for 
the estimation of the parameters, but further brings a reduction in the variability of those 
estimates, thanks to the importance weights. 

r _l -j- 
§ s 



"SBC SBC ABQJLL ABQJLL 
opti. mom. mom. corr. 



~mC SEC ABIUbL AB<DbL 
opti. mom. mom. corr. 




Figure 2: Comparison of ABC evaluations of posterior expectations (top, with true values 
in dashed lines) and posterior variances (bottom) of the parameters (ao, a\) of the ARCH(l) 
model with 100 observations. The first two columns correspond to two choices of summary 
statistics for the ABC algorithm (least squares estimates and mean of the logo's plus auto- 
correlations of order 2 and 3, respectively). The last two columns correspond to two sets of 
constraints for the BC e i alternative (first three moments and second moment plus autocor- 
relation of order 1 plus correlation with previous observation for the reconstituted et's). All 
experiments are based on the same reference table of 10 simulations, with the tolerance e 
chosen as the 1% quantile of the distances. 

A much more complex dynamic model is the GARCH(1, 1) model of (|39p that can be 
formalized as the observation of yt ~ A/"(0, of) when 

of = a + ai2/t_i + /W-i ( 3 ) 

under the constraints ao, a±, (3\ > and a\ < 1, that is, yt = o~tet- Given the constraints 
on the parameters, a natural prior is to choose an exponential distribution on ao, for instance 
an exponential £xp(l) distribution, and a Dirichlet 1)3(1, 1, 1) on (a\,(3i, 1 — a.\ — /3i). An 
ABC approach requires the choice of summary statistics, which are necessarily non-sufficient 
since the model is a state-space model. Following (|38p . we use the maximum likelihood 
estimator as summary statistics, relying on the R function garch for its derivation despite 
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its lack of stability. Since (j40p derived natural score constraints for the empirical likelihood 
associated with this model, we used their constraints to build our BC c i algorithm. Fig. [3] 
provides a comparison of both approaches with the MLE. It shows in particular that the ABC 
algorithm is unable to produce acceptable inference in this case, even in the most favorable 
case when it is initialized at a satisfactory maximum likelihood estimate (as shown by the 
bottom row). The BC e i algorithm is performing better, even though it fails to catch the 
correct range of f}\. 



a 



a6c BSei mLe 



a 



a6c bS mLe 



Afc B5ei MLE 



Figure 3: Comparison of evaluations of posterior expectations (with true values in dashed 
lines) of the parameters {olq, a%, Pi) of the GARCH(l) model with 250 observations. The first 
row corresponds to an optimal ABC algorithm (using the MLE as summary statistic and with 
the tolerance e chosen as the 5% quantile of the distances), the second row corresponds to 
the BC c i algorithm based on the constraints derived in (|4U|) . and the third row corresponds 
to the MLE derived by the R procedure garch initialized at the true parameter value. 



Another type of non-iid model relying on the superposition of an unknown number of 
gamma point processes and processed in (41) through a (non-Bayesian) alternative to ABC 
is discussed in the SI as an additional illustration of the possibilities of the empirical likelihood 
perspective for complex models, offering a free benchmark for evaluating the ABC outcome. 
Figure S7 shows a clear improvement brought by BC e i over the corresponding ABC outcome. 



3.4 Population genetics 

We compare our proposal with the reliable ABC-based estimates given by ©. We set up 
two toy experiments that are designed to defeat ABC, using pseudo observed data. The two 
evolutionary scenarios are given in Figure |4j In all experiments, we only consider microsatel- 
lite loci and assume that the effective population size is identical over all populations of the 
scenario. 




pop o POP 1 pop o POP 1 POP 2 
1st experiment 2nd experiment 



Figure 4: Evolutionary scenarios of the two experiments in population genetics. 
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Figure 5: Comparison of the original ABC (curve) with the histogram of the simulation 
from the BC e i~AMIS sampler in the case of the population genetics model given in Scenario 
A, based on uniform priors on (log lo (0), log 10 (r)) on (—1, 1.5) x (—1, 1) and 10 4 particles. 



In the first experiment, we consider two populations which diverged at time r in the past, 
see Figure [4] (left). Our pseudo observed datasets are made of thirty diploid individuals per 
population genotyped at a hundred independent loci. We compare the marginal posterior 
distributions of the unknown parameters 9 and r computed with the ABC method (using 
the DIY-ABC software of (|42p ) and with the BC e i-AMIS sampler. In this case, results are 
improved when the ^-component of the composite scores, namely dg log £2(T>\(j)), is restricted 
to the sum over all pairs of genes lying in the same population. Otherwise, as can be checked 
via a quick simulation experiment, BC e i systematically under-estimates 9. Figure [5] shows 
the typical discrepancy between both results: ABC and BC e i agree on the mutation rate 9, 
but the BC e i estimation of r is more accurate, see also Table 1. 

In the second experiment, we consider three populations, see Figure|4] (right): the last two 
populations diverged at time t\ and their common ancestral population diverged from the 
first population at time T2- The sample comprises thirty diploid individuals per population 
genotyped at a hundred independent loci. In contrast to the first experiment, all components 
of the composite scores are computed here by summing over all pairs of genes whatever the 
population to which they belong. The results given in Table 1 show that ABC and BC e i 
mainly agree on both parameters 9 and t\, but BC e i is slightly more accurate than ABC on 

T2- 

Table SI gives a comparison of the computing times for both algorithms, showing the 
difference of magnitudes between them. This is due to the simulation of the the simulated 
datasets for ABC: While this difference should not be over-interpreted, it signals a potential 
for self- assessment and testing that is missing for ABC methods. 



10 



Table 1: Comparison of the original ABC and BC c i on 100 Monte Carlo replicates. We 
use two point estimates of the parameters: (1) posterior mean and (2) posterior median, 
and measured the error between the estimation and the "true" value used to generate the 
observation with (1) the root mean square error in the case of the posterior mean and (2) 
the median absolute deviation in the case of the posterior median. We also compare credible 
intervals (with probability 0.8) through the proportion of Monte Carlo replicates in which 
the "true" value falls into this interval. 



First experiment 





Root Mean Square Error 


Median Absolute Deviation 


Coverage of the credible 




of posterior mean 


of posterior median 


interval with probability 0.8 




ABC 


BCel 


ABC 


BCel 


ABC 


BCel 


e 


0.0971 


0.0949 


0.071 


0.059 


0.68 


0.81 


T 


0.315 


0.117 


0.272 


0.077 


1.0 


0.80 



Second experiment 





Root Mean Square Error 


Median Absolute Deviation 


Coverage of the credibility 




of posterior mean 


of posterior median 


interval of probability 0.8 




ABC 


BCel 


ABC 


BCel 


ABC 


BCel 


e 


0.0593 


0.0794 


0.0484 


0.0528 


0.79 


0.76 


n 


0.472 


0.483 


0.320 


0.280 


0.88 


0.76 


T2 


29.6 


4.76 


4.13 


3.36 


0.89 


0.79 



4 Discussion 

When compared with ABC methods, the (often) significant time savings provided by BC e i due 
to the lack of pseudo-sample simulation may open wider ranges for processing models involv- 
ing complex likelihoods. For instance, in population genetics, ABC is severely hindered by 
the time spent simulating a dataset when modelling isolation by distance in a continuously 
distributed population, or when studying a large set of SNP markers even on quite simple 
evolution scenarios. Moreover, when the dataset is composed of large sets of markers, the 
summary statistics proposed in ABC (in DIY-ABC, these are averages of some quantita- 
tive statistics over all loci) ignores some (statistical) information, while BC e i manages to 
recover most of it, more specifically to estimate divergence on large datasets. Improvements 
in accuracy of estimation and computational efficiency are also possible in other contexts as 
illustrated in the range of examples given above. 

Even when BC e i requires the same computing time as ABC, it uses the outcome in a 
very different perspective and provides a benchmark likelihood that helps in evaluating the 
pertinence of the ABC approximation, as illustrated in the gamma point process of SI. 

We acknowledge that a caveat of the empirical likelihood is that it requires a careful choice 
of the constraint ([!]). Those pivotal quantities have to be connected to the parameter in an 
identifying way, which may require complex manipulations as in the gamma process case 
or even be impossible. However, repeated experimentation is often available, as illustrated 
by the normal example and the population genetic experiments (where we computed the 
composite score on both a restricted set of pairs and all pairs of genes). Checking for the 
accuracy of the approximation means that a constraint in BC e ishould be tested on simulated 
datasets in controlled experiments where the true parameters are known, although much less 
than in ABC runs. Then we can test coverage of credibility intervals, and measure the error 
of various point estimates based on the output of the scheme. 
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Supplementary information (SI) 

Convergence of the empirical likelihood approximation 

The validation of the empirical likelihood approximation is provided by Theorem 3.4 of ([1]), which 
establishes an extension of Wilk's theorem to the EL likelihood ratio. (Note that n~ n is the maximum 
of L el .) 

Theorem Let X, Yy , . . . , Y n be independent random vectors with common distribution Fq . For 
9 e 0, let h(X,9) e R s . Let 9 e Q be such that V&r(h(Yi,9o)) is finite and has rank q > 0. If Qq 

satisfies E(h(X,9oj) — 0, then —2 log f cl ( °l — }_L_ — \ — nlj ^.2^ ^ n distribution when n — > oo. 

We also reproduce here an illuminating comment from Art Owen: "TVie interesting thing about 
Theorem 3.4 is what is not there. It includes no conditions to make 9 a good estimate of '9q, nor even 
conditions to ensure a unique value for 9q, nor even that any solution 9q exists. Theorem 3.4 applies 
in the just determined, over-determined, and under- determined cases. When we can prove that our 
estimating equations uniquely define 9q, and provide a consistent estimator 9 of it, then confidence 
regions and tests follow almost automatically through Theorem 3.4"- 

Pairwise composite likelihoods in population genetics 

We detail here the derivation of the composite likelihoods used for the version of the BC c i algorithm 
implemented in the case of the population genetics study. 

Two genes from the same deme First, we recall that we scale the time axis so that a pair of 
genes of the same deme coalesces at a random time with an exponential distribution with rate 1. We 
now consider a given locus and two microsatellite genes from our sample that come from the same 
deme. We denote their respective allelic state by x\ and xi- Their most recent common ancestor 
(MRCA) dates back to a time T, where T ~ £(1)- We assume that the mutation rate, namely 9/2 
does not vary along the whole history of our populations. Therefore, conditioned on T, the number of 
mutations between xi (i = 1, 2) and the MRCA is distributed according to a Poisson distribution with 
mean 9T/2. Hence, conditional on T, the number Ao of mutations between x% and X2 is a Poisson 
variable with mean 8T and 

Ejy^lT] =exp{(9T(w- 1)} . 



Thus 



E^ ] = / e -t exp{0t(u- 1)} dt 
Jo 

1 1/(1 + 0) 



1 + 0(1 -it) l-0/(l + 0)u' 

i.e., Nq ~ Qe(9/(1 + 0)), the geometric distribution with positive weight at 0. Finally, the difference 
between both genes is the accumulation over the Nq mutations, i.e., 

No 
k=l 
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where the e^'s are iid Rademachers (±1 with equal probability). Thus, 

N 



lfe=i 

oo 

= E[(cOSC)"°] = -p)(p COS 0" 

71=0 

since iVo ~ Ge{p) with p = 



1 

1 -p 1 



1-pcosC 1 + 0(1- cos C) 
which proves that the pairwise likelihood is 



(4) 



e 2 ( Xl , X2 \<p) = ^ T L=p(e)^-^, ( 5 ) 

with p{6) = 9/(1 + 9 + VTT29). 

Two genes from different denies We now consider two genes that come from two different 
demes that diverged from an ancestral deme at time r in the past. We denote the allelic state of the 
two ancestors at time r by x\ and x 2 , respectively. Then, X\ — x 2 — [x\ — x\) + [x\ — x 2 ) + (x 2 — x 2 ), 
where x\ — x 2 follows a distribution whose Fourier transform is given by Q, while (xj — x®), j = 1,2 
are iid., whose distribution is given by the difference of two allelic states separated by a fixed time r. 
This distribution is derived in Equation (3) of ([2]): 

P( Xj - x] =8)= e- T0 / 2 Is(T9/2), 

where Is denotes the <5th-order modified Bessel function of the first kind, given by (n > 0) 

(z/2) n+2k 



I„ n (z) = I n (z)=J2 



k\(n + k)\ 



k=0 

Using the independence between (xi — x®) and (xi — X®), we obtain 

P((H - xi) + [xi - x\) = S) = c- t6 I s (t9). (6) 

We then retrieve this distribution by computing Fourier transforms in the same vein as above. First, 
we note that the number TVi of mutations between x\ and its ancestor at time r is a Poisson variate 
with parameter t9/2. And, 

E [ e «(xS-*i)] = E |e e l ^ x "- Xl ^\N^ } = E ^JJ E(exp«e fc ))^ 



E [(cosC)^] = £e-^/ 2 ^^(cosC) 



n=0 



since Ni ~ Vo(t9/2) 
= exp{r6»(cosC-l)/2} . (7) 

Finally, the distribution of xi — x\ is a (discrete) convolution product between the distributions given 
by ^ and ([6| which yields 

_ T g +oo 

l 2 ( Xl:X2 \t) = ^== J2 p(0) W I\ Xl - X2 + k (r9). 

k— — oo 
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Quantile estimation 

Examples of quantile distributions are the three-, four- and five-parameter Tukey's lambda distribu- 
tions and their generalizations and the Burr family of distributions; particular examples include the 
g-and-h and g-and-fc distributions ([3j 01 [5j [7]) . 

Proposed methods for estimation of quantile distributions include maximum likelihood estimation 
using numerical approximations to the likelihood ([71 [51 IS]), moment matching (fTOI ITT]) , location and 
scale- free shape functionals (|12[), percentile matching quantile matching (13) and, more recently, 
ABC (jTTl [T51 [TBI . Sequential Monte Carlo approaches for multivariate extensions of the g-and-fc have 
also been proposed ([37]). 

There has been a number of ABC approaches proposed for this problem. For example, (fT5|) 
adopted the ABC-MCMC algorithm of (|T5[) . in which draws of 9 are based on a Metropolis algorithm 
with a Gaussian proposal distribution, and are accepted based on the rule p(S(D), S(D/)) < e), where 
D is the entire set of order statistics, p is the Euclidean norm and e is heuristically chosen after 
inspection of a histogram of p(S, St) obtained from a preliminary run using a very large value of e. 
This approach has recently been improved by (fl"6|) through more sophisticated MCMC approaches, the 
use of regression summary statistics for D based on percentiles and their powers, and more automated 
choices of e. However, they still maintain a form of distance-based measure p(S, S') in accepting 9. 

Normal estimation 

Figures S1-S3 evaluate the impact on the posterior distribution approximation of increasing the 
number of constraints in the empirical likelihood definition. Since this is a formal example, the true 
posterior distribution is available. 

Quantile estimation 

The BC i experiment involved evaluation of Algorithm 1 for estimation of the parameters of the 
g-and-fc distribution using the two values of 9 = (A,B,g,k), namely 9n = (0,1,0,0), which cor- 
responds to the standard normal distribution, and 9a — (3,2,1,0.5); which was chosen by ([3"6"[) as 
'an interesting, far-from-normal distribution. The simulation experiment comprised multiple repe- 
titions of BC c i using different combinations of sample size, n = (100,500), number of iterations, 
M = (1000, 5000, 10000), and number of constraints (p = 3, 4, 4, 5, 9), corresponding to percentile sets 
(0.2,0.5,0.8), (0.2,0.4,0.6,0.8), (0.1,0.4,0.6,0.9), (0.1,0.25,0.5,0.75,0.9) and (0.1, 0.2, 0.9). Two 
sets of priors were considered for (A, B,g, k): U[0, 5] 4 (denoted as Pi) and U{— 5, 5). {7(0, 5), U(— 5, 5), (—0.1, 1) 
(denoted as P2). Although the priors were set independently for each element of 9, the four elements 
were drawn together at each iteration of the algorithm, so that the same importance weight was 
attached to the values A^ l \ B^\ </W, feW drawn in the ith iteration. The experiment was replicated 
ten times with different draws of samples of size n. Posterior means and standard deviations were 
computed for each parameter, and the overall goodness of fit to the true curve was assessed by compar- 
ing the true quantiles at (0.05, 0.10, 0.95) with two measures: the estimated mean at each quantile 
(denoted by RSSm) and the average of the estimated quantile for each importance sample (RSSt). 

Boxplots of the posterior means and standard deviations are shown in Figure S4 and Figure S5, 
respectively, for the four parameters, based on 9a, prior P2 and the 20 replicates for M = (5000, 10000), 
for ten of the trials: p = 3,4, 4, 5, 9 for n — 100 (trials 1-5) and n = 500 (trials 6-10). Boxplots for the 
corresponding overall goodness of fit measures (RSSm, RSSt) are given in Figure S6. 

Superposition of point processes 

(|4"T[) discuss an alternative to ABC, using fractional design and linear interpolation. While their 
purpose is the non-Bayesian processing of models with intractable likelihood functions, they propose 
as their main example a model consisting in the superposition of N renewal processes with waiting 
times Tij (i = 1, . . . , M), j = 1, . . .) distributed as Q (a, (3) variables, when N is unknown. The renewal 
processes are thus 

Cil = T ili Ci2 — Cil + T i2, • • • 
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and the observations are made of the first n values of the Cij' s i 

z\ = min{Cy}, z 2 = min{C»j;Ctj > zi}, • • • ) 

ending with 

z n = min{Cy ; Cij > Zn-i} ■ 

This model offers an interesting testing ground for BC i in that the data points z t are neither iid nor 
Markov. It is however possible to recover and exploit an iid structure in this case by first simulating 
a pseudo-dataset, (z*, . . . , z*), as in ABC settings, and then deriving a sequence of renewal processes 
indicators (yi, . . . , u n ), as 

z l = Cvi 1 ! z 2 — Cv2j2 ! • ■ ■ i z n = Cv-njn ' 

These indicators are thus distributed from the prior distribution on the Vt& and an iid sample of 
Q(a,/3) variables can be derived from those indicators and the genuine data, leading to an associated 
empirical likelihood. As shown on Figure S7, when applied to a simulated dataset (as in (JHJ), the 
empirical likelihood approximation produces a better approximation than the corresponding ABC 
solution based on the same statistics as (|4"Tj) (for exactly the same computational cost). 

Time gains in population genetic models 

In general, the speed of executing an ABC algorithm depends on many factors, including: 

• one's ability to program an efficient simulator from the model distribution and to compute the 
selected summary statistics, 

• the choice of the threshold e 

• and the size of the Monte Carlo sample, denoted M, 
and the speed of BC e i depends on: 

• the difficulty to optimize under the constraints [1] (in the population genetics examples, this is 
not straightforward because of the Bessel functions and of the various series involved when the 
two individuals are not in the same deme), 

• and the size M of the Monte Carlo sample. 

While producing many simulations from the model distribution often is the stumbling block for ABC 
algorithms, the selection of the constraints [1] and the time requirements of the optimization step 
are both highly variable and delicate to quantify. While all the experiments described in this paper 
induced no inflation in computing time and mostly significant reductions, we cannot exclude the 
possibility of BC i requiring more computing time than ABC. 

Both population genetic experiments conducted in this paper analyse datasets with a large number 
of loci (one hundred). Thus ABC, which requires simulations of all loci to produce a simulated dataset, 
is quite time consuming and particularly so when the evolutionary scenario is more complex than the 
one in the first experiment. We compare here the computing times required by our implementation of 
the BC c i-AMIS sampler and by DIYABC © on an Intel Xeon W3680 Plateform with GNU/Linux. 
Both methods were parallelized over five among the six cores of this CPU with the OpenMP API. 
Table [2] exhibits computation time averages on ten replicates of the estimation. 
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Table 2: Computing times for DIYABC and BC c i in both population genetics experiments 

ABC BC e i 
Experiment (DIYABC software) (BC cl -AMIS code) 

1 21 min 24 sec 

2 16 hours 55 sec 
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Figure 6: Comparison of the true posterior on the normal mean (solid lines) with the 
empirical distribution of weighted simulations resulting from Algorithm BC c i. The normal 
sample sizes are 25 (column 1), 50 (column 2) and 75 (column 3), the number of simulated 
0's is 10 3 and the effective samples sizes M ESS are given on top of each histogram. The 
constraint is on the first moment of the dataset. 




Figure 7: Comparison of the true posterior on the normal mean (solid lines) with the 
empirical distribution of weighted simulations resulting from Algorithm BC c i. The normal 
sample sizes are 25 (column 1), 50 (column 2) and 75 (column 3), the number of simulated 
0's is 10 3 and the effective samples sizes M ESS are given on top of each histogram. The 
constraint is on the first two moments of the dataset. 
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Figure 8: Comparison of the true posterior on the normal mean (solid lines) with the 
empirical distribution of weighted simulations resulting from Algorithm BC c i. The normal 
sample sizes are 25 (column 1), 50 (column 2) and 75 (column 3), the number of simulated 
0's is 10 3 and the effective samples sizes M ESS are given on top of each histogram. The 
constraint is on the first three moments of the dataset. 




Figure 9: Boxplots of the variations of the posterior means of the four parameters of the 
g-and-k distribution, based on BC e i approximations, for n = 100 observations (1 to 5) and 
n = 500 observations (6 to 10), based on M = 10 4 simulations and 10 replications. The 
moment conditions used in the BC e i algorithm are quantiles of order (0.2,0.5,0.8) (1 and 
6), (0.2,0.4,0.6,0.8) (2 and 7), (0.1,0.4,0.6,0.9) (3 and 8), (0.1,0.25,0.5,0.75,0.9) (4 and 9), 
and (0.1, 0.2,.... 0.9) (5 and 10). 



20 



Figure 10: Same graph as Figure [9] for the standard deviations. 
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Figure 11: Goodness of fit measures (RSSm, RSSt) for the same experiment as in Figure p)| 
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Figure 12: Approximate posterior distributions of the parameters (a, j3, N) for the superpo- 
sition gamma process model, using a simulated dataset of 90 observations, with a = .5,(3 = .8, 
and N = 5: (top) BC e i output; (bottom) ABC output. 
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